## Data Science Africa 2017 - Prerequisite software

The [Data Science Africa 2017 summer school](http://www.datascienceafrica.org/dsa2017/) will mainly cover the broad areas of Machine Learing and Data Science. This is the third summer school on data science following the previous ones held at Makere in Kampala in 2016 and Dedan Kimathi University of Technology in 2015.

To ensure we hit the ground running, it is essential you install the prerequiste software and test it out and make sure it is working on your computer. The venue for the summer school will have some computers on which the software will have been installed but you are advised to come with your own laptop with the software installed. 

Luckily all the software required has already been prepackaged in a bundle called [Anaconda](https://www.continuum.io/downloads). You can download the various versions of the software for your laptop OS and architecture from the [Anaconda website](https://www.continuum.io/downloads). Please download **python 3.x**, the notebooks may not work with python 2!

### Testing the software
If you installed the Anaconda software properly, you should be reading this :-). If you are reading this on a colleagues machine who got it installed then you are advised to ask your colleague who owns this machine how he/she did it.

We will be mainly using the jupyter notebook for the lab sessions, please get familiar with using the notebook. To ensure everything is working please follow the instructions that follow and run the corresponding commands. The goal state is to have no errors at the end of this process.

### Reading in data using Pandas
[Pandas](http://pandas.pydata.org) is a Python library that is particularly good for doing any data-related tasks in Python. We will use several other libraries in Python as well. You are adviced to go through the [scipy-lectures (sections 1.1 - 1.4 and 3.1)](http://www.scipy-lectures.org) to get a good grounding in the libraries we will be using.

To test that everything is working fine:
1. Go to the website and download the [Production of major crops by district, UCA 2008 / 2009](http://catalog.data.ug/dataset/production-of-major-crops-by-district-uca-2008-2009) data. It includes excel sheets for the 4 regions of Uganda detailing in each production figures for major crops.
2. Use Pandas to read in the first sheet of data from the excel document as shown below
3. Work through some of the example analyses shown and complete the questions that follow.
4. Go get a cold beer, drink 3/4 of it and inform a friend that you are now ready for the data science summer school 2017 !

In [None]:
# Import the relevant libraries to this notebook
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
datafilename = "productionmajorcropsregio20082009.xls" # this file must be in your path

# Read in data skipping some rows at the top and the last row at the bottom that has the sub-totals
data = pd.read_excel(datafilename, sheet="CENTRAL", skiprows=2, skip_footer=1, thousands=',') 
data.head() # show the first 5 rows of the data. try data.head(20)

In [None]:
# You can select a sub dataset
data_maize_cassava = data[['District', 'Maize', 'Cassava']]
data_maize_cassava.head()

In [None]:
# Get  total maize production for central region
# First remove the last row with t
print("Maize production for Central region is {} Metric tonnes".format(data['Maize'].astype('float32').sum()))

In [None]:
# Plot cassava production by district
data.plot(x='District', y='Cassava', kind='bar')

### For you to do

In [None]:
# Write your code to complete the following functions

def read_allregionsdata(datafilename):
    # Read in all the data from the different regions
    # from the excel sheet "datafilename"
    
    # WRITE YOUR CODE HERE
    data_central = 0
    data_eastern = 0
    data_northern = 0
    data_western = 0
    
    return data_central, data_eastern, data_northern, data_western

def get_total_sp_eastern(data_eastern):
    # Get the total tonnage of Sweet potatoes from the Eastern region
    
    # WRITE YOUR CODE HERE (AND REPLACE THE DEFAULT VALUE FOR total_sp..)
    total_sp = 0
    
    return total_sp
    
def get_max_prod_western(data_western):
    # Get the maximum production of Finger millet in the western region
    # This is not the total production but the maximum production of a district
    
    # WRITE YOUR CODE HERE
    max_prod = 0
    
    return max_prod

def get_dist_max_prod_western(data_western):
    # Get the district with the maximum production of Finger millet in the western region
    
    # WRITE YOUR CODE HERE
    max_prod_dist = ""
    
    return max_prod_dist



In [None]:
# Testing...
data_central, data_eastern, data_northern, data_western = read_allregionsdata(datafilename)
assert get_total_sp_eastern(data_eastern) == 312405.0, "Sorry wrong answer, minus 1 beer :-("
assert get_max_prod_western(data_western) == 9674.0, "Answer feels wrong, this is the western region we are talking about"
assert get_dist_max_prod_western(data_western).strip() == "Nakasongola", "Naah, that district was abolished last year !"
print("Congratulations, all is working just fine !!!")

### Extra Bonus questions (in some instances "bonus" can be replaced with "beer")

In [None]:
def combine_alldatasets(data_central, data_western, data_eastern, data_northern):
    # Combine all the regional datasets into one national dataset
    
    # WRITE YOUR CODE HERE
    nationalData = 0
    
    return nationalData

def get_total_national_cassava_prod(nationalData):
    # Attempt this after the previous question
    # Get the total national production for cassava (tons) for 2008-2009
    
    # WRITE YOUR CODE HERE
    total_cassava = 0
    
    return total_cassava

In [None]:
d = combine_alldatasets(data_central, data_western, data_eastern, data_northern)
assert get_total_national_cassava_prod(d) == 1.63924e+06, "Comeon man, cassava is a big deal - much more than that !"
print("Congratulations, you get 2-tickets to beerville; all you can drink :-)")